Speed up LoadMode.DBT_LS by caching dbt ls output in Airflow Variable#1014
Conversation
✅ Deploy Preview for sunny-pastelito-5ecb04 canceled.
|
| """ | ||
| logger.info(f"Trying to parse the dbt project using dbt ls cache {self.cache_identifier}...") | ||
| if settings.enable_cache and settings.experimental_cache: | ||
| dbt_ls_cache = Variable.get(self.cache_identifier, "") |
There was a problem hiding this comment.
It might be better to query the db directly, so you bypass any configured secrets backend.
There was a problem hiding this comment.
@jedcunningham how do you advise us to do this?
Wouldn't there be a risk that with this, we'd create a larger coupling of Cosmos to Airflow that could be more sensitive to different versions of Airflow?
There was a problem hiding this comment.
Jed, since this has been long standing, I'll be merging as it is - and I can make a follow up PR to address after your feedback.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1014 +/- ##
==========================================
+ Coverage 95.82% 96.05% +0.22%
==========================================
Files 62 62
Lines 3020 3196 +176
==========================================
+ Hits 2894 3070 +176
Misses 126 126 ☔ View full report in Codecov by Sentry. |
|
Looks promising! |
5a25d9d to
cfed136
Compare
cfed136 to
9656788
Compare
e3988cb to
327192a
Compare
pankajkoti
left a comment
There was a problem hiding this comment.
Looks good to me. There are some minor cosmetic suggestions for the documentation that we can address iteratively in a subsequent PR.
Great feature support! Thank you! 👏🏽
| types-PyYAML, | ||
| types-attrs, | ||
| attrs, | ||
| types-pytz, |
There was a problem hiding this comment.
| types-pytz, |
Would we still need it now?
| "pytest-cov", | ||
| "pytest-describe", | ||
| "sqlalchemy-stubs", # Change when sqlalchemy is upgraded https://docs.sqlalchemy.org/en/14/orm/extensions/mypy.html | ||
| "types-pytz", |
There was a problem hiding this comment.
| "types-pytz", |
same
|
|
||
| Cosmos 1.5 introduced a feature to mitigate the performance issue associated with ``LoadMode.DBT_LS`` by caching the output | ||
| of this command as an `Airflow Variable <https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/variables.html>`_. | ||
| Based on an initial `analysis <https://github.com/astronomer/astronomer-cosmos/pull/1014>`_, enabling this setting reduced some DAGs ask queueing from 30s to 0s. Additionally, some users `reported improvements of 84% <https://github.com/astronomer/astronomer-cosmos/pull/1014#issuecomment-2168185343>`_ in the DAG run time. |
There was a problem hiding this comment.
| Based on an initial `analysis <https://github.com/astronomer/astronomer-cosmos/pull/1014>`_, enabling this setting reduced some DAGs ask queueing from 30s to 0s. Additionally, some users `reported improvements of 84% <https://github.com/astronomer/astronomer-cosmos/pull/1014#issuecomment-2168185343>`_ in the DAG run time. | |
| Based on an initial `analysis <https://github.com/astronomer/astronomer-cosmos/pull/1014>`_, enabling this setting reduced some DAGs task queueing from 30s to 0s. Additionally, some users `reported improvements of 84% <https://github.com/astronomer/astronomer-cosmos/pull/1014#issuecomment-2168185343>`_ in the DAG run time. |
| Caching the partial parse file | ||
| ~~~~~~~~~~~~~ |
There was a problem hiding this comment.
| Caching the partial parse file | |
| ~~~~~~~~~~~~~ | |
| Caching the partial parse file | |
| ~~~~~~~~~~~~~~~~~~~~~ |
| Caching the dbt ls output | ||
| ~~~~~~~~~~~~~ |
There was a problem hiding this comment.
| Caching the dbt ls output | |
| ~~~~~~~~~~~~~ | |
| Caching the dbt ls output | |
| ~~~~~~~~~~~~~~~~~~~ |
| - Default: ``True`` | ||
| - Environment Variable: ``AIRFLOW__COSMOS__ENABLE_CACHE`` | ||
|
|
||
| .. enable_cache_dbt_ls: |
There was a problem hiding this comment.
| .. enable_cache_dbt_ls: | |
| .. _enable_cache_dbt_ls: |
|
@tatiana A little late to this PR, but not late to 1.5.0: I'm thinking the I think, going into something like Cosmos 2.0, it makes sense to consolidate the class DbtGraph:
load_method_mapping: dict[LoadMode, Callable[[], None]] = {}
def __init__(
self,
project_config: ProjectConfig | None = None,
profile_config: ProfileConfig | None = None,
execution_config: ExecutionConfig | None = None,
render_config: RenderConfig | None = None,
**kwargs
):as per the code example in #895. Adding more kwargs means either we cannot do this, or we need to add more deprecations. I think doing this will also help make the API cleaner, more consistent, and easier for users to reason about. Right now there are just a lot of things for users to tweak and it can be overwhelming. Keeping it consistent and locked inside the configs can reduce confusion. WDYT? A few other notes:
|
Hi @dwreeves, Thanks a lot for all the very relevant feedback! I'm sorry I missed it, as it was made after the PR was merged.
We can expose them there as part of a follow-up PR / future version of Cosmos. Just added: #1110 I'm also happy with the proposal to refactor the I created a PR to address the feedback on the dead code: #1111. Your proposal to refactor the load implementation in #1001 is very good as well. Would you like to work on it? |
I would, but I've been busy... still trying to find time to contribute to this project. |
Improve significantly the
LoadMode.DBT_LSperformance. The example DAGs tested reduced the task queueing time significantly (from ~30s to ~0.5s) and the total DAG run time for Jaffle Shop from 1 min 25s to 40s (by more than 50%). Some users reported improvements of 84% in the DAG run time when trying out these changes. This difference can be even more significant on larger dbt projects.The improvement was accomplished by caching the dbt ls output as an Airflow Variable. This is an alternative to #992, when we cached the pickled DAG/TaskGroup into a local file in the Airflow node. Unlike #992, this approach works well for distributed deployments of Airflow.
As with any caching solution, this strategy does not guarantee optimal performance on every run—whenever the cache is regenerated, the scheduler or DAG processor will experience a delay. It was also observed that the key value could change across platforms (e.g.,
DarwinandLinux). Therefore, if using a deployment with heterogeneous OS, the key may be regenerated often.Closes: #990
Closes: #1061
Enabling/disabling this feature
This feature is enabled by default.
Users can disable it by setting the environment variable
AIRFLOW__COSMOS__ENABLE_CACHE_DBT_LS=0.How the cache is refreshed
Users can purge or delete the cache via Airflow UI by identifying and deleting the cache key.
The cache will be automatically refreshed in case any files of the dbt project change. Changes are calculated using the SHA256 of all the files in the directory. Initially, this feature was implemented using the files' modified timestamp, but this did not work well for some Airflow deployments (e.g.,
astro --dagssince the timestamp was changed during deployments).Additionally, if any of the following DAG configurations are changed, we'll automatically purge the cache of the DAGs that use that specific configuration:
ProjectConfig.dbt_varsProjectConfig.env_varsProjectConfig.partial_parseRenderConfig.env_varsRenderConfig.excludeRenderConfig.selectRenderConfig.selectorThe following argument was introduced in case users would like to define Airflow variables that could be used to refresh the cache (it expects a list with Airflow variable names):
RenderConfig.airflow_vars_to_purge_cacheExample:
Cache key
The Airflow variables that represent the dbt ls cache are prefixed by
cosmos_cache. When usingDbtDag, the keys use the DAG name. When usingDbtTaskGroup, they consider the TaskGroup and parent task groups and DAG.Examples:
DbtDag"cosmos_dag" will have the cache represented by"cosmos_cache__basic_cosmos_dag".DbtTaskGroup"customers" declared inside teh DAG "basic_cosmos_task_group" will have the cache key"cosmos_cache__basic_cosmos_task_group__customers".Cache value
The cache values contain a few properties:
last_modifiedtimestamp, represented using the ISO 8601 format.versionis a hash that represents the version of the dbt project and arguments used to run dbt ls by the time the cache was createddbt_ls_compressedrepresents the dbt ls output compressed using zlib and encoded to base64 to be recorded as a string to the Airflow metadata database.Steps used to compress:
We are compressing this value because it will be significant for larger dbt projects, depending on the selectors used, and we wanted this approach to be safe and not clutter the Airflow metadata database.
Some numbers on the compression
zliband encoded usingbase64- to 6% of the original size.The latency used to compress is in the order of milliseconds, not interfering in the performance of this solution.
Future work
LoadMode.DBT_LScache #1090ObjectStorage? [Feature] Allow storing dbt ls cache into Object Store #1072Example of results before and after this change
Task queue times in Astro before the change:

Task queue times in Astro after the change on the second run of the DAG:

This feature is available in
astronomer-cosmos==1.5.0a8.The previous screenshots were taken when trying out the alpha release using the following Astro CLI project:
https://github.com/astronomer/cosmos-demo
The same was reproduced by running the DAG using Airflow standalone.